batches author lookups in imports by gabestein · Pull Request #3126 · knowledgefutures/pubpub

gabestein · 2024-07-11T20:15:09Z

Issue(s) Resolved

Resolves #2795

Test Plan

Screenshots (if applicable)

Optional

Notes/Context/Gotchas

Supporting Docs

isTravis · 2026-04-15T17:41:56Z

Pushed a commit with an updated approach that adapts to the changes made to main since this PR was made.

Problem

When importing a document with many authors, getAttributions in workers/tasks/import/metadata.ts fires N individual getSearchUsers database queries (one per author name) via Promise.all. Each query runs an ILIKE '%name%' scan on the User table. With many authors this causes DB connection contention and can exceed the 2-minute worker timeout.

Original PR Approach (July 2024)

The original branch (gs/batch-author-import) had two commits:

Throttled queries with asyncMap — Replaced Promise.all + .map(async ...) with asyncMap at concurrency: 2, so only 2 author lookups run at a time instead of all at once.
Bumped the global worker timeout — Changed maxWorkerTimeSeconds from 120 → 300, affecting all worker tasks (export, import, archive).

This reduced DB pressure but still issued N individual queries — just slower. And the timeout change was a blunt instrument.

Updated Approach (April 2026)

Three targeted changes that batch at the SQL level rather than throttle at the application level:

1. New `batchSearchUsers` function — `server/search/queries.ts`

Added a new export that takes an array of author names and issues a single User.findAll query with all names combined into one OR clause. Results are mapped back per-name in JS using the same matching logic as the original per-name query (fullName contains, slug contains, email exact match). The existing getSearchUsers is left untouched for use by the search API.

2. Rewritten `getAttributions` — `workers/tasks/import/metadata.ts`

Instead of Promise.all over N async lookups, the function now:

Filters authorEntries to collect all string names upfront
Calls batchSearchUsers once with the full list
Synchronously maps results back into the attribution objects

No concurrency management needed since it's a single query.

3. Per-task timeout — `workers/queue.ts`

Added import: 300 (5 minutes) to the existing customTimeouts map. This gives imports more headroom for large documents without affecting other task types. The codebase already had this mechanism in place for the archive task.

Key Insight

One SQL query with 50 names in its WHERE clause is far cheaper than 50 individual queries run 2 at a time. Batching at the database level eliminates the need for application-level concurrency control entirely.

pubpubBot temporarily deployed to pubpub-pipel-gs-batch-a-znoshx July 11, 2024 20:27 Inactive

pubpubBot temporarily deployed to pubpub-pipel-gs-batch-a-znoshx July 12, 2024 19:22 Inactive

gabestein closed this Apr 15, 2026

gabestein reopened this Apr 15, 2026

updated approach

1d6ffa3

isTravis force-pushed the gs/batch-author-import branch from 80face9 to 1d6ffa3 Compare April 15, 2026 17:40

isTravis marked this pull request as ready for review April 15, 2026 17:50

Merge branch 'main' into gs/batch-author-import

298e657

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batches author lookups in imports#3126

batches author lookups in imports#3126
gabestein wants to merge 2 commits intomainfrom
gs/batch-author-import

gabestein commented Jul 11, 2024 •

edited

Loading

Uh oh!

isTravis commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gabestein commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue(s) Resolved

Test Plan

Screenshots (if applicable)

Optional

Notes/Context/Gotchas

Supporting Docs

Uh oh!

isTravis commented Apr 15, 2026

Problem

Original PR Approach (July 2024)

Updated Approach (April 2026)

1. New batchSearchUsers function — server/search/queries.ts

2. Rewritten getAttributions — workers/tasks/import/metadata.ts

3. Per-task timeout — workers/queue.ts

Key Insight

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabestein commented Jul 11, 2024 •

edited

Loading

1. New `batchSearchUsers` function — `server/search/queries.ts`

2. Rewritten `getAttributions` — `workers/tasks/import/metadata.ts`

3. Per-task timeout — `workers/queue.ts`